-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
checkstate: make tests more robust #249
Conversation
The checks manager uses contexts to cancel check runners. This happens during checkstate.PlanChanged() when the previous checkers are all terminated, and the new checker definition is applied. However, context cancellation is asynchronous (as highlighted in the context documentation), so the check runners will continue to complete even after the context cancellations returned. Due to this race condition, the runner can still launch a check even after the context cancellation is complete. This problem is reproduced here: https://gist.github.com/flotter/f6bf32d7cc33cb1d53a3ddc39fc3b64c The result of this is easily seen the following unit test failure: panic: internal error: reaper must be started goroutine 75 [running]: github.com/canonical/pebble/internals/reaper.StartCommand /home/flotter/pebble/internals/reaper/reaper.go:164 github.com/canonical/pebble/internals/overlord/checkstate.(*execChecker).check() /home/flotter/pebble/internals/overlord/checkstate/checkers.go:170 github.com/canonical/pebble/internals/overlord/checkstate.(*checkData).runCheck() /home/flotter/pebble/internals/overlord/checkstate/manager.go:199 github.com/canonical/pebble/internals/overlord/checkstate.(*checkData).loop() /home/flotter/pebble/internals/overlord/checkstate/manager.go:182 created by github.com/canonical/pebble/internals/overlord/checkstate.(*CheckManager).PlanChanged /home/flotter/pebble/internals/overlord/checkstate/manager.go:74 In this particular case, the problem is that test teardown stops the reaper process which is responsible for reaping check commands, while the final (edge case) run of the check is still completing. Following the same logic, this could also cause check overlap between terminating previous checks and starting new checks in checkstate.PlanChanged(). The following changes fix the problem: - Use a WaitGroup to represent the group of active checks in the check manager - Wait() on the group to complete before creating the next group of checks when PlanChanged is called. - Each check runner will use a deferred exit to decrement the group counter.
The following changes are made: - Remove a check which depends on check output in TestTimeout as there is no way to remove the race between the 'echo FOO' completing before the timeout is triggered. This failure can easily be reproduced by running the test under heavy cpu load (i.e. stress --cpu 10 --io 4) manager_test.go:140: c.Assert(check.ErrorDetails, Equals, "FOO") ... obtained string = "" ... expected string = "FOO" - Make waitCheck() more conservative when waiting for a check runner iteration to complete. The worse case timeout can safely be much longer as this does not have a general impact on test duration. This failure can easily be reproduced by running the test under heavy cpu load (i.e. stress --cpu 10 --io 4) manager_test.go:134: check := waitCheck(c, mgr, "chk1", func(check *CheckInfo) bool { return check.Status != CheckStatusUp }) manager_test.go:283: c.Fatalf("timed out waiting for check %q", name) ... Error: timed out waiting for check "chk1"
@@ -45,12 +45,14 @@ func (s *ManagerSuite) SetUpSuite(c *C) { | |||
setLoggerOnce.Do(func() { | |||
logger.SetLogger(logger.New(os.Stderr, "[test] ")) | |||
}) | |||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved reaper.Start() and reaper.Stop() to test level. The command cleanup should happen before the next test starts, and it also provides a more robust test environment for race conditions, as now each test must cleanly exit before the reaper is stopped.
Note that the reaper is bound to the package level test binary process ID, and therefore moving it to the test setup and teardown make it incompatible for parallel testing. However, parallel testing is strictly opt-in, so this is still a valid requirement for this particular package tests.
@@ -161,17 +162,15 @@ func (s *ManagerSuite) TestCheckCanceled(c *C) { | |||
}, | |||
}) | |||
|
|||
// Wait for command to start (output file grows in size) | |||
prevSize := 0 | |||
// Wait for command to start (output file is not zero in size) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previous logic did not make 100% sense, so I simplified it a bit.
c.MkDir() is test managed and removed after the test. The tempfile is therefore always zero to start with, and so we can simply check for a non-zero value. In the original code, the prevSize = len(b)
appeared redundant as the previous check would break out before it could happen for any non-zero value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! This is really good work and detailed analysis. Much appreciated!
In terms of review: approved, but with a couple of very small suggestions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At last! Thanks for the great work
@@ -137,7 +139,6 @@ func (s *ManagerSuite) TestTimeout(c *C) { | |||
c.Assert(check.Failures, Equals, 1) | |||
c.Assert(check.Threshold, Equals, 1) | |||
c.Assert(check.LastError, Equals, "exec check timed out") | |||
c.Assert(check.ErrorDetails, Equals, "FOO") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As explained in the commit message, this check is too racey. There is no way there is any guarantee this would be done before the timeout cuts off the command execution. Given that this test focuses on the timeout mechanism, and not command output logging, I feel it is justified to remove this to improve test robustness.
@@ -185,7 +184,6 @@ func (s *ManagerSuite) TestCheckCanceled(c *C) { | |||
stopChecks(c, mgr) | |||
|
|||
// Ensure command was terminated (output file didn't grow in size) | |||
time.Sleep(50 * time.Millisecond) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
stopChecks() calls PlanChanged(), which terminates the previous checks synchronously now, so this is not needed.
Small stability improvements, no API changes -- going to merge this without further review. We can always tweak in subsequent PRs if needed. |
checkstate: make check runners exit cleanly
The checks manager uses contexts to cancel check runners. This happens
during checkstate.PlanChanged() when the previous checkers are all
terminated, and the new checker definition is applied.
However, context cancellation is asynchronous (as highlighted in the
context documentation), so the check runners will continue to complete
even after the context cancellations returned.
Due to this race condition, the runner can still launch a check even after
the context cancellation is complete.
This problem is reproduced here:
https://gist.github.com/flotter/f6bf32d7cc33cb1d53a3ddc39fc3b64c
The result of this is easily seen the following unit test failure:
panic: internal error: reaper must be started
goroutine 75 [running]:
github.com/canonical/pebble/internals/reaper.StartCommand
/home/flotter/pebble/internals/reaper/reaper.go:164
github.com/canonical/pebble/internals/overlord/checkstate.(*execChecker).check()
/home/flotter/pebble/internals/overlord/checkstate/checkers.go:170
github.com/canonical/pebble/internals/overlord/checkstate.(*checkData).runCheck()
/home/flotter/pebble/internals/overlord/checkstate/manager.go:199
github.com/canonical/pebble/internals/overlord/checkstate.(*checkData).loop()
/home/flotter/pebble/internals/overlord/checkstate/manager.go:182
created by github.com/canonical/pebble/internals/overlord/checkstate.(*CheckManager).PlanChanged
/home/flotter/pebble/internals/overlord/checkstate/manager.go:74
In this particular case, the problem is that test teardown stops the reaper
process which is responsible for reaping check commands, while the final
(edge case) run of the check is still completing.
Following the same logic, this could also cause check overlap between terminating
previous checks and starting new checks in checkstate.PlanChanged().
The following changes fix the problem:
Use a WaitGroup to represent the group of active checks in the check manager
Wait() on the group to complete before creating the next group of checks
when PlanChanged is called.
Each check runner will use a deferred exit to decrement the group counter.
checkstate: make tests more robust
The following changes are made:
Remove a check which depends on check output in TestTimeout as
there is no way to remove the race between the 'echo FOO'
completing before the timeout is triggered, without significant rework.
This failure can easily be reproduced by running the test under
heavy cpu load (i.e. stress --cpu 10 --io 4)
manager_test.go:140:
c.Assert(check.ErrorDetails, Equals, "FOO")
... obtained string = ""
... expected string = "FOO"
Make waitCheck() more conservative when waiting for a check
runner iteration to complete. The worse case timeout can
safely be much longer as this does not have a general impact
on test duration.
This failure can easily be reproduced by running the test under
heavy cpu load (i.e. stress --cpu 10 --io 4)
manager_test.go:134:
check := waitCheck(c, mgr, "chk1", func(check *CheckInfo) bool {
return check.Status != CheckStatusUp
})
manager_test.go:283:
c.Fatalf("timed out waiting for check %q", name)
... Error: timed out waiting for check "chk1"